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Abstract 

The G-protein coupled receptor (GPCR) superfamily is currently the largest class of ther- 
apeutic targets. In silico prediction of interactions between GPCRs and small molecules is 
therefore a crucial step in the drug discovery process, which remains a daunting task due to the 
difficulty to characterize the 3D structure of most GPCRs, and to the Hmited amount of known 
Hgands for some members of the superfamily. Chemogenomics, which attempts to character- 
ize interactions between all members of a target class and all small molecules simultaneously, 
has recently been proposed as an interesting alternative to traditional docking or ligand-based 
virtual screening strategies. 

We propose new methods for in silico chemogenomics and validate them on the virtual 
screening of GPCRs. The methods represent an extension of a recently proposed machine learn- 
ing strategy, based on support vector machines (SVM) , which provides a flexible framework to 
incorporate various information sources on the biological space of targets and on the chemical 
space of small molecules. We investigate the use of 2D and 3D descriptors for small molecules, 
and test a variety of descriptors for GPCRs. We show fo instance that incorporating informa- 
tion about the known hierarchical classification of the target family and about key residues in 
their inferred binding pockets significantly improves the prediction accuracy of our model. In 
particular we are able to predict ligands of orphan GPCRs with an estimated accuracy of 78.1%. 



1 Introduction 



The G-protein coupled receptor (GPCR) superfamily is comprised of an estimated 600-1,000 mem- 
bers and is the largest known class of molecular targets with proven therapeutic value. They are 
ubiq uitous in our body, being involved in regulation of every major mammalian physiological sys- 
tem ( Bockaert and Fid , and play a role in a wide range of disorders including allergies, 
cardiovascular dysfunct ion, depression, obesity, can c er, pain, d i abetes, and a variety o f cent ral ner- 
vous system disorders ( Deshpande and Penn . 2006 : Hilj 2006; Catapano and Manji . 2007). They 
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are integral membrane proteins sharing a common global topology that consists of seven trans- 
membrane alpha helices, an intracellular C-terminal, an extracellular N-terminal, three intracellular 
loops and three extracellular loo ps. There ar e four main classes of GPCRs (A, B, C and D) depend- 
ing on their sequence similarity ( Horn et al. . 20031 ). Their location on the cell surface makes them 
readily accessible to drugs, and 30 GPCRs are the targets for the majorit y of best-selling drugs , 



representing about 40% of all prescription pharmaceuticals on the market (IFredholm et all 1200 



Besides, the human genome contains several hundred unique GPCRs which have yet to be assigned 
a clear cellular funct ion, suggesting t hat t hev are likely to remain an important target class for new 
drugs in the future (|Lin and Civelli l2004l l. 

Predicting interactions in silico between small molecules and GPCRs is not only of particular 
interest for the drug industry, but also a useful step for the elucidation of many biological process. 
First, it may help to decipher the function of so-called orphan GPCRs, for which no natural ligand 
is known. Second, once a particular GPCR is selected as a target, it may help in the selection of 
promising molecule candidates to be screened in vitro against the target for lead identification. 

In silico virtual screening of GPCRs with classical approaches is however a daunting task for 
at least two reasons. First, the 3D structures are currently known for only two GPCRs (bovine 
rhodopsin and human /?2-adrenergic receptor). Indeed, GPCRs, like other membrane proteins, are 
notoriously difficult to crystallize. As a result, docking strategies for screening small molecules 
against GPCRs are often limited by the difficulty to model correctly the 3D structure of the tar- 
get. To circumvent the lack of experimental structures, various studies have used 3D structural 
models of GPCRs built by homology modeling using bovine rhodopsin as a template structure. 
Docki ng a library of molecules i nto these modeled structures allowed the recovery of known lis 



ands ( Evers and Klabunde . 20051 ). and even identification of new ligands ( Cavasotto et al 



200, 



However, docking methods still suffer from docking and scoring inaccuracies, and homology models 
are not always reliable-enough to be employed in target-based virtual screening. Methods have 
been proposed to enhance the quality of the models by global optimization and fiexible dock- 



ing ( Cavasotto et al. . 20031 ). or by using different sets of receptor models. Nevertheless, these 



methods are expected to show limited performances for GPCRs sharing low sequence similarity 
with rhodopsin, especially in the case of receptors belonging to classes B, C and D. Alternatively, 
ligand-based strategies, also known as quantitative structure-activity relationship (QSAR), attempt 
to predict new ligands from previously known ligands, often using statistical or machine learning 
approaches. Ligand-based approaches are interesting because they do not require the knowledge of 
the target 3D structure and can benefit from the discovery of new ligands. However, their accuracy 
is fundamentally limited by the amount of known ligands, and degrades when few liga nds are known . 



Although these methods were successfully used to retrieve strong GPCR binders (iRolland et al 



2OO5I ). they are efficient for lead optimization within a previously identified molecular scaffold, but 



are not appropriate to identify new families of ligands for a target. At the extreme, they cannot be 
pursued for the screening of orphan GPCRs. 

Instead of focusing on each individual target independently from other proteins, a recent trend in 
the pharmaceutical industry, often referred to a s chemoqenomics, is t o screen molecules against sey - 
eral targets of the same family simultaneously ( Kubinyi et al. . 2004 : Jaroch and Weinmann . 20061 ). 
This systematic screening of interactions between the chemical space of small molecules and the 
biological space of protein targets can be thought of as an attempt to fill a large 2D interaction 
matrix, where rows correspond to targets, columns to small molecules, and the (i, j)-th entry of the 
matrix indicates whether the j-th molecule can bind the i-th target. While in general the matrix 
may contain some description of the strength of the interaction, such as the association constant of 
the complex, we will focus in this paper on a simplified description that only differentiates binding 
from non-binding molecules, which results in a binary matrix of target-molecule pairs. This matrix 
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is already sparsely filled with our current knowledge of protein-ligand interactions, and chemoge- 
nomics attempts to fill the holes. While classical docking or ligand-based virtual screening strategies 
focus on each single row independently from the others in this matrix, i.e., treat each target in- 
dependently from each others, the chemogenomics approach is motivated by the observation that 
similar molecules can bind similar proteins, and that information about a known interaction be- 
tween a ligand and a GPCR could therefore be a useful hint to predict interaction between similar 
molecules and similar GPCRs. This can be of particular interest when, for example, a particular 
target has few or no known ligands, but similar proteins have many: in that case it is tempting to 
use the information about the known ligands of similar proteins for a ligand-based virtual screening 
of the target of interest. In this context, we can formally define in silico chemogenomics as the 
problem of predicting interactions between a molecule and a ligand (i.e., a hole in the matrix) from 
the knowledge of all other known interactions or non-interactions (i.e., the known entries of the 
matrix). 

Recent reviews ( Kubinyi et al. . 2004 : Jaroch and Weinmann . 20061 : Klabunde . 2007 : Rognan . 



20071 ) describe several strategie s for in si/ico chemogenomics. A first class of approaches, called 



ligand-based chemogenomics by Rognan ( 20071 ). pool together targets at the level of families (such 
as GP CR) or subfami l ies (such as purinergic GPCR) and learn a model for ligands at the level of the 
family JBalakin et aZl.l2002l:lKlabundel . [2006l l. Other approaches, termed target-based chemogenomic 
approaches by iRognanI (j2007l ). cluster receptors based on ligand bi nding site similaritv and again 
pool together known ligands for each cluster to inf er shared ligan ds (jFrimurer et Finally, 
a third strategy termed target-ligand approach by Rognan (j2007l ) attempts to predict ligands for a 
given target by leveraging binding information for other targets in a single step, that is, without 
first attempting t o def ine a particular set of similar receptors. This strategy was pioneered by 
Bock and Q^m^ hmi ) to predict ligands of orphan GPCR. They merged descriptors of ligands and 



targets to describe putative ligand-receptor complexes, a nd used SVM t o disc riminate real complexes 
from ligand-receptors pairs that do not form complexes. Erhan et al. (j2006l ) followed a similar idea 
with different descriptors, and showed in particular that the SVM formulation allows to generalize 
the use of vectors of descriptors to the use of positive definite kernels to describe the chemical and 
the biological space in a computationally efficient framework. Erhan et al. (j2006l ) were not able to 
show, however, significant benefits with respect to the individual approach that learns a separate 
classifier for each GPCR (except in the case of orphan GPCRs, for which their approach performed 
better than the baseline random classifier). Recently, i n the context o f pred icting interactions 
between peptides and different alleles of MHC-I molecules, Jacob and Vert ( 20081 ) followed a similar 
approach and highlighted the importance of choosing adequate descriptors for small molecules and 
targets. They obtained state-of-the-art prediction accuracy for most MHC-I allele, in particular for 
those with few known binding peptides. 

In this paper we go one step further in this direction and present an in silico chemogenomics 
approach specifically tailored for the screening of GPCRs, although the r nethod could in pr i nciple 
be adapted to other classes of therapeut i c targ ets. We follow the idea of Bock and Cough (j2005l ) 
and the algorithmic trick of lErhan et al\ (|2006l ). which allows us to systematically test a variety of 
descriptors for both the molecules and the GPCRs. We test two families of 2D and 3D descriptors 
to describe molecules, including a new 3D kernel, and six ways to describe GPCRs, including a 
description of their relative positions in current hierarchical classifications of the superfamily, and 
information about key residues li kelv to be in contac t with the ligand. We test the approach on 



the data of the GLIDA database ( Okuno et l2006l ). which contains 34686 reported interactions 



between human GPCRs and small molecules, and observe that the choice of the descriptors has a 
significant impact on the accuracy of the models. In particular, the best results are reached when 
using the description of GPCRs within the hierarchical classification of the superfamily, combined 
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with a set of 2D descriptors of small molecules. This allows us to obtain dramatic improvements of 
the prediction accuracy with respect to the individual learning setting. In an experiment where we 
simulate the prediction of ligands for orphan GPCRs, we obtain accuracies of 78.1%, significantly 
above the 50% baseline accuracy of a random predictor. 



2 Method 



In this section, we first review the methods proposed by Bock and Gough ( 20051 ): Erhan et al. (j2006l ) 



for in silico chemogenomics with SVM, before presenting the particular descriptors we propose to 
use for molecules and GPCRs within this framework. 

2.1 In silico chemogenomics with machine learning 

We consider the problem of predicting interactions between GPCRs and small molecules. For this 
purpose we assume that a list of target/small molecule pairs {(fi, mi), . . . , m„)}, known to 
interact or not, is given. Such information is often available as a result of systematic screening 
campaigns in the pharmaceutical industry, or on dedicated databases. Our goal is then to create a 
model to predict, for any new candidate pair {t,m), whether the small molecule m is likely to bind 
the GPCR t. 

A general method to create the predictive model is to follow these four steps: 

1. Choose Titar descriptors to represent each GPCR target t in the biological space by a ntar- 
dimensional vector ^tar{t) = ($tar(*)> • • • > ^ta?''i't))] 

2. In parallel, choose Umoi descriptors to represent each molecule m in the chemical space by a 
rzmoz-dimensional vector ^raoi{m) = (^>^„,(m), . . . , ^^™;°'(m)); 

3. Derive a vector representation of a candidate target/molecule complex ^pair(t,m) from the 
representations of the target ^tarit) and of the molecule ^moK™-); 

4. Use a statistical or machine learning method to train a classifier able to discriminate be- 
tween binding and non-binding pairs, using the training set of binding and non-binding pairs 

{^pairitl,mi), . . . ,^pairitn,mn)} 

While the first two steps (selection of descriptors) may be specific to each particular chemogenomics 
problem, the last two step s define the particular strategy used for in silico chemogenomics. For 
example. Bock and Goughl ( 200ll . 20051 ) proposed to concatenate the vectors ^tar{t) and $moK"^) 
to obtain a (nja^ -|-nmoi)"diniensional vector represe ntation of th e ligaii d-target complex ^pair{t,rn) 



(|2006l ) followed a slightly different 



strategy for the third step, by forming descriptors for the pair {t, m) as product of small molecule 
and target descriptors. More precisely, given a molecule m described by a vector $moK"^) ^^^d a 
GPCR t described by a vector ^tar(t), the pair (t,m) is represented by the tensor product: 

^pair{t,m) = ^tar{t) «) '^moli'm) , (1) 

that is, a {nt ar ^ JT-moZ )~dimensional vector whose entries are products of the form ^fg^^it) x *^^q;(^)) 
for 1 < i < ntar and 1 < j < rimoi- A SVM is then used as an inference engine, to estimate a 
linear function f{t,m) in the vector space of target/molecule pairs, that takes positive values for 
interacting pairs and negative values for non-interacting ones. 
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The main motivation for using the tensor product ([T]) is that it provides a systematic way to 
encode correlations between small molecule and target features. For example, in the case of binary 
descriptors, the product of two features is 1 if both the molecule and the target descriptors are 1, 
and zero otherwise, which amounts to encode the simultaneous presence of particular features of the 
molecule and of the target that may be important for the formation of a complex. A potential issue 
with this approach, however, is that the size of the vector representation ritar x iT-mol for a pair may 
be prohibitively large for practical computation and manipulation. For example, using a vector of 
molecular descriptors of size 1024 for molecules, and representing a protein by the vector of counts of 
all 2-mers of amino-acids in its sequence {dt = 20 x 20 = 400) results i n more than 400k dimensions 
for the representation of a pair. As pointed out by Erhan et al. ( 20061 ). this computational obstacle 



can however be overcome when a SVM is used to train the linear classifier, thanks to a trick often 
referred to as the kernel trick. Indeed, a SVM does not necessarily need the explicit computation of 
the vectors representing the complexes in the training set to train a model. What it needs, instead, 
is the inner products between these vectors, and a classical property of tensor products is that the 
inner product between two tensor products ^pair{t,rn) and ^pairit' ,m') is the product of the inner 
product between ^tar(t) and ^tarit'), on the one hand, and the inner product between ^moi{i^) 
and ^rnol{rn'), on the other hand. More formally, this property can be written as follows: 

= '^tar{tV'^tar{t') X <^mol{m)^ '^mol{m') , (2) 

where uTv = uivi + . . . + UdVd denotes the inner product between two d-dimensional vectors u and 
V. In other words, the SVM does not need to compute the ntar x rimoi vectors to describe each pair, 
it only computes the respective inner products in the target and ligand spaces, before taking the 
product of both numbers. 

This flexibility to manipulate molecule and target descriptors separately can moreover be com- 
bined with other tricks that sometimes allow to compute efficiently the inner products in the target 
and ligand spaces, respectively. Many s uch inner products, al so called kernels ^ have been developed 
recently both in cora p utational biology (Scholkopf et al. . 20041 ) and chemistry ( Kashima et al. . 2003 : 



Gartner et all . l2003l : iMahe et al\ . l2005l ). and can be easily combined within the chemogenomics 
framework as follows: if two kernels for molecules and targets are given as: 

Kmolim,m') = ^rnolim)~'^^rnol{m), 
Ktar{t,t') = <^>taritV'^tar{t'), 

then we obtain the inner product between tensor products, i.e., the kernel between pairs, by: 

{m,m'). (4) 

In summary, as soon as two vectors of descriptors or kernels K^g and Ktar are chosen, we can 
solve the in silico chemogenomics problem with an SVM using the product kernel Jl]) between pairs. 
The particular descriptors or kernels used should ideally encode properties related to the ability of 
similar molecules to bind similar targets or ligands respectively. 

In the next two subsections, we present different possible choices of descriptors - or kernels - 
for small molecules and GPCRs, respectively. 

2.2 Descriptors for small molecules 

The problem of explicitly representing and storing small molecules as finite- dimensional vectors 
has a l ong history in chemoinfor r aatics , and a multitude of molecular descriptors have been pro- 
posed ( Todeschini and Consonni . 20021 ). These descriptors include in particular physicochemical 
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properties of the molecules, such as its solubility or logP, descriptors derived from the 2D structure 
of the molecule, s uch as fragment counts or structural fingerprints, or descriptors extracted from 
the 3D structure ( Gasteiger and Engej 20031 ). Each classical fingerprint vector and vector repre- 
sentation of molecules define an explicit "chemical space" in which each molecule is represented 
by a finite- dimensional vector, a nd these vector repres entations can obviously be used as such to 



define kernels between molecules ( Azencott et al. . 200?! ). Alternatively, some authors have recently 



proposed some kernels that generalize some of these sets of descriptors and correspond to inner prod- 
ucts between large- or even infinite- dimensional vectors of descriptors. These descriptors encode. 



of the molecules (Kashima et al 



2004: Gartner et al. 


. 2003: Mahe et al. 200^) 


jMahe et al. 


. 200fil: 


Azencott et al. 20071). 



or various features 



In this study we select two existing kernels, encoding respectively 2D and 3D structural infor- 
mation of the small molecules, and propose a new 3D kernel: 

• The 2D Tanimoto kernel. Our first set of descriptors is meant to characterize the 2D structure 
of the molecules. For a small molecule m, we define the vector ^moiiiT^) as the binary vector 
whose bits indicate the presence or absence of all linear graph of length u or less as subgraphs 
of the 2D structure of /. We chose u = 8 in our experiment, i.e., characterize the molecules 
by the occurrences of linear subgraphs of length 8 or less, a value previouslv observed to 
give good results in several virtual screening tasks JMahe et fl/.l . I200RI 1. Moreover, instead of 
directly taking the inner product between vectors as in ([3]), we use the Tanimoto kernel: 



-^ligandij : I ) 



(5) 



which was proven to be a valid inner product by Ralaivola et all ( 20051 ). giving very competitive 
results on a variety of QSAR or toxicity prediction experiments. 

3D pharmacophore kern el While 2D structure s are known to be very competitive in ligand- 
based virtual screening (lAzencott et "^■ l2007l ). we reasoned that some specific 3D conforma- 
tions of a few atoms or functional groups may be responsible for the interaction with the target. 
Thus, we decided to test descriptors representing the presence of potent ial 3-point pharma - 
cophores. For this, we used the 3D pharmacophore kernel proposed by Mahe et~al\ (j2006i ). 
that generalizes 3D pharmacophore fingerprint descriptors. This approach implies the choice 
of a 3D conformer for each molecule. In absence of sufficient data available for bound ligands 
in GPCR structures, we chose to build a 3D version of the ligand base in which molecules 
are represented in an estimated minimum energy conformation. For each of the 2446 retained 
ligands, 25 conformers were generated with the Omega program (OpenEye Scientific Software) 
using standard parameters, except for a 1 RMSD clustering of the conformers, instead of the 
0.8 default value. A 3D ligand base was generated by keeping the conformer of lowest energy 
for each ligand. Partial charges were calculated for all atoms using the molcharge program 
(OpenEye Scientific Software) with standard param eters. This l i gand base was then used to 



calculate a 3D pharmacophore kernel for molecules (jMahe et al. . 2006h . 



We used the freely and publicly available ChemCPI^ software to compute the 2D and 3D 
pharmacophore kernel. 



^ Available at |http : / / chemcpp . sourcef orge . net [ 
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2.3 Descriptors for GPCRs 

SVM and kernel methods are also widely used in bioinformatics ([Scholkopf et iaZI, [200i), and a 



variety of approaches have been proposed to design kernels between proteins, ranging from kernels 



based on the 3D structures of proteins (jPobson and Doigl . l2005l : iBorgwardt et all l2005l: iQiu et al 



based on the amin o -acid s equence of a prote i n (iJaakkola et a/.. 200 Ql:lLeslie et a/.l. 2002l:lTsuda et al., . 
20021 : iLeslie et al\ . 120041 : IVert et n,l\ . l2004l: iKuang et fl/.l 12005.: Cutiiri and VerfJ l2005l) to kernels 



20071 ) or on the pattern of occurrences of proteins in multiple sequenced genomes (IVertl . l2002l ). These 
kernels have been used in conjunction with SVM or other kernel methods for various tasks related 
to structural or functional classification of proteins. While any of these kernels can theoretically 
be used as a GPCR kernel in ([H), we investigate in this paper a restricted list of specific kernels 
described below, aimed at illustrating the flexibility of our framework and test various hypothesis. 



The Dirac kernel between two targets t,t' is: 



Koiracit, t') 



1 if t = t' , 
otherwise. 



(6) 



This basic kernel simply represents different targets as orthonormal vectors. From dlj we 
see that orthogonality between two proteins t and t' implies orthogonality between all pairs 
{l,t) and {l',t') for any two small molecules c and c'. This means that a linear classifier for 
pairs (/, t) with this kernel decomposes as a set of independent linear classifiers for interactions 
between molecules and each target protein, which are trained without sharing any information 
of known ligands between different targets. In other words, using Dirac kernel for proteins 
amounts to performing classical learning independently for each target, which is our baseline 
approach. 

The multitask kernel between two targets t, t' is defined as: 



multitask 



{t,t') = l + KDirac{t,t') 



This kernel, originally proposed in the context of multitask learning lEvgeniou et 
removes the orthogona lity of different proteins to allow sharing of information. As explained in 
Evgeniou et plugging K^ultitask ill © amounts to decomposing the linear function 

used to predict interactions as a sum of a linear function common to all GPCRs and of a linear 
function specific to each GPCR: 



w 



A consequence is that only data related to the the target t are used to estimate the specific 
vector wt, while all data are used to estimate the common vector w general- In our framework 
this classifier is therefore the combination of a target-specific part accounting for target-specific 
properties of the ligands and a global part accounting for general properties of the ligands 
across the targets. The latter term allows to share information during the learning process, 
while the former ensures that specificities of the ligands for each target are not lost. 

• The hierarchy kernel. Alternatively we could propose a new kernel aimed at encoding the 
similarity of proteins with respect to the ligands they bind. In the GLIDA database indeed, 
GPCRs are grouped into 4 classes based on sequence homology and functional similarity: the 
rhodopsin family (class A), the secretin family (class B), the metabotropic family (class C) and 
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some smaller classes containing other GPCRs. The GLIDA database further subdivides each 
class of targets by type of ligands, for example amine or peptide receptors or more specific 
families of ligands. This also defines a natural hierarchy that can be used to compare GPCRs. 

The hierarchy kernel between two GPCRs was therefore defined as the number of common 
ancestors in the corresponding hierarchy plus one, that is, 

Khierarchy{t,t') = {<^h{t) , '^h{t')) , 

where ^h{t) contains as many features as there are nodes in the hierarchy, each being set to 1 
if the corresponding node is part of i's hierarchy and otherwise, plus one feature constantly 
set to one that accounts for the "plus one" term of the kernel. 

The binding pocket kernel. Because the protein-ligand recognition process occurs in 3D 
space in a pocket involving a limited number of residues, we tried to describe the GPCR 
space using a representation of this pocket. The difficulty resides in the fact that although 
the GPCR sequences are known, the residues forming this pocket and its precise geome- 
try are a priori unknown. However, the two available X-Ray structures, together with 
mutagen esis data showed t hat t he binding pockets are situated in a similar region for all 
GPCRs JKratochwil et all boO^ ). In order to identify residues potentially involved in the 



binding pocket of GPCRs of unknown structure studied in this work, we proceeded in several 
steps, (a) The two kno wn structures (PDB entrie s 1U19 and 2RH1) were superimposed using 
the STAMP algorithm (iRussell and Bartonl . In the superimposed structures, the retinal 



and 3-(isopropylamino)propan- 2-ol ligands are very close, which is in agreement with global 
conservation of binding pockets, as shown on Figured) (b) The structural alignment of bovine 
rhodopsin and of human /32-adrenergic receptor was used to generate a sequence alignment of 
these two proteins, (c) For both structures, in order to identify residues potentially involved in 
stabilizing interactions with the ligand (residues of the pocket), we selected residues that pre- 
sented at least one atom situated at less than 6 from at least one atom of the ligand. Figure [2] 
shows that these two pockets clearly overlap, as expected, (d) Residues of the two pockets 
(as defined in (c)) were labeled in this structural sequence alignment. These residues were 
found to form small sequence clusters that were in correspondence in this alignment. These 
clusters were situated mainly in the apical region of transmembrane segments and included a 
few extracellular residues, (e) All studied GPCR sequences, inclu ding bovine rhodopsi n and 



of human /32-adre nergic receptor were al i gned using CLUSTALW (jChenna et all l2003l ) with 



Blosum matrices (jHenikoff and Heniko For each protein, residues in correspondence 



with a residue of the binding pocket (as defined above) of either bovine rhodopsin or human 
/32- adrenergic receptor were retained. This lead to a different number of residues per protein, 
because of sequence variability. For example, in extracellular regions, some residues from 
bovine rhodopsin or human /32-adrenergic receptor had a corresponding residue in some se- 
quences but not in others. In order to provide a homogeneous description of all GPCRs, in the 
list of residues initially retained for each protein, only residues situated at positions conserved 
in almost all GPCRs were kept, (f) Each protein was then represented by a vector whose 
elements corresponded to a potential conserved pocket. This description, although appearing 
as a linear vector filled with amino acid residues, implicitly codes for a 3D information on the 
receptor pocket, as illustrated on Figure [2l These vectors were then used to build a kernel 
that allows comparison of binding pockets. The classical way to represent motifs of constant 
length as fixed length vectors is to encode the letter at each position by a 20-dimensional 
binary vector indicating which amino acid is present, resulting in a 180-dimensional vector 
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representations. In terms of kernel, the inner product between two binding pocket motifs in 
this representation is simply the number of letters they have in common at the same positions: 

I 

Kpb{x,x') = ^6{x[i],x'[i]), 

i=l 

where I is the length of the binding pocket motifs (31 in our case), x[i] is the i-th residue in 
X and <5(x[z], is 1 if x[i] = x'[i], otherwise. This is the baseline pocket binding kernel. 
Alternatively, using a polynomial kernel of degree p over the baseline kernel is equivalent, in 
terms of feature space, to encoding p-order interactions between amino acids at different posi- 
tions. In order to assess the relevance of such non-linear extensions we tested this polynomial 
pocket binding kernel, 

Kppb{x, x) = {Kpb{x, x') + l)^ . 

We only used a degree p = 2, although a more careful choice of this parameter could further 
improve the performances. 

• The binding pocket hierarchy kernel. Because of the link between binding pockets and ligand 
recognition, we also defined a new hierarchy based on the sequence alignment of the binding 
pocket amino acid vectors without gaps. To do this, we used a PAM matrix with high values 
of gap insertion and extension to compare each couple of GPCR vectors. The obtained scores 
were used in UPGMA (Unweighted Pair Group Method with Arithmetic mean) to determine 
a binding pocket similarity based hierarchy. We obtained a tree comparable to phylogenetic 
trees, and that happens to be share many substructures with the GLIDA hierarchy. 




Figure 1: Representation of the binding pocket of /32-adrenergic receptor (in red) and bovine 
Rhodopsin (in black) viewed from the extracellular surface. On the center of the pocket, 3- 
(isopropylamino)propan-2-ol and cis-retinal have been repres ented to show the size a nd the position 



of the pocket around each ligand. Figure drawn with VMD (jHumphrey et 
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HELIX 5 



HELIX 7 



Figure 2: 3-(isopropylamino)propan-2-ol and the protein environment of /52-adrenergic recep- 
tor as viewed from the extracellular surface. Amino acid side chains are represented for 6 of 
the 31 residues (in cyan, blue and red) of the binding pocket motif. Transmembranes helix 
and 3- fisopropylamino)propan -2-ol are colored in black and red respectively. Figure drawn with 
VMD iniimphrev e.t all\W9(i ). 



3 Data 



We used the GLIDA GPCR-hgand database (|Okuno et all , bood ) which includes 22964 known lig 



ands for and 3738 GPCRs from human, rat and mouse. The ligand base contains highly diverse 
molecules, from ions and very small molecules up to peptides. In order to eliminate unwanted 
molecules such as inorganic compounds and molecules with unsuitable molecular weights, we fil- 
tered the GLIDA ligand base using the filter program (OpenEye Scientific Software) with standard 
parameters. The most important filtering feature here was to keep molecules of molecular weights 
ranging from 150 Da to 450 Da. Overall, the GLIDA ligand base was filtered in order to retain 
molecules that had the physico- chemical characteristics of drugs. This filter retained 2688 molecules. 
Because the GLIDA ligand base contains a few duplicates, we eliminated these redundancies, which 
lead to 2446 different molecules, available under a 2D description files and giving 4051 interactions 
with the human GPCRs. Elimination of duplicates present in the GLIDA base was important here 
because it could have lead to overfitting in the learning step. For each positive interaction given 
by this restricted set, we generated a negative interaction involving the same receptor and one of 
the ligands that was in the database and was not indicated as one of its ligands. This probably 
generated some false negative points in our benchmark, and it would be interesting to use experi- 
mentally tested negative interactions. We loaded the sequences of all GPCRs that are able to bind 
any of these ligands, which lead to 80 sequences, all corresponding to human GPCRs. In the GLIDA 
database, GPCRs are classified in a hierarchy (as mentioned above) which was also loaded for use 
in the hierarchy kernel. 

4 Results 

We ran two different sets of experiments on this dataset in order to illustrate two important points. 
In a first set of experiments, for each GPCR, we 5-folded the data available, i.e. the line of the 
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2D Tanimoto 


3D pharmacophore 


Dirac 


86.2 ± 1.9 


84.4 ± 2.0 


multitask 


88.8 ± 1.9 


85.0 ±2.3 


hierarchy 


93.1 ± 1.3 


88.5 ±2.0 


binding pocket 


90.3 ± 1.9 


87.1 ± 2.3 


poly binding pocket 


92.1 ± 1.5 


87.4 ± 2.2 


binding pocket hierarchy 


93.0 ± 1.4 


90.0 ±2.1 



Table 1: Prediction accuracy for the first experiment with various hgand and target kernels. 



interaction matrix corresponding to this GPCR. The classifier was trained with four folds and the 
whole data from the other GPCRs, i.e., all other lines of the interaction matrix. The prediction 
accuracy for the GPCR under study was then tested on the remaining fold. The goal of these first 
experiments was to evaluate if using data from other GPCRs improved the prediction accuracy for 
a given GPCR. In a second set of experiments, for each GPCR we trained a classifier on the whole 
data from the other GPCRs, and tested on the data of the considered GPCR. The goal was to 
assess how efficient our chemogenomics approach would be to predict the ligands of orphan GPCRs. 
In both experiments, the C parameter of the SVM was selected by internal cross validation on the 
training set among 2*, i € {—8, —7, ...,5,6}. 

For the first experiment, since learning an SVM with only one training point does not really 
make sense and can lead to "anti-learning" less than 0.5 performances, we set all results r involving 
the Dirac GPCR kernel on GPCRs with only 1 known ligand to max(r, 0.5). This is to avoid any 
artefactual penalization of the Dirac approach and make sure we measure the actual improvement 
brought by sharing information across GPCRs. 




Figure 3: GPCR kernel Gram matrices {Ktar) for the GLIDA GPCR data with multitask, hierarchy, 
binding pocket and binding pocket hierarchy kernels. 

Table [T] shows the results of the first experiments with all the ligand and GPCR kernel combi- 
nations. For all the ligand kernels, one observes an improvement between the individual approach 
(Dirac GPCR kernel, 86.4%) and the baseline multitask approach (multitask GPCR kernel, 88.4%). 
The latter kernel is merely modeling the fact that each GPCR is uniformly similar to all other 
GPCRs, and twice more similar to itself. It does not use any prior information on the GPCRs, 
and yet, using it improves the global performance with respect to individual learning. Using more 
informative GPCR kernels further improves, sometimes considerably, the prediction accuracy. In 
particular, the hierarchy kernels add more than 4.5% of precision with respect to naive multitask 
approach. All the other informative GPCR kernels also improve the performance. The polynomial 
binding pocket kernel and the pocket binding hierarchy kernels are almost as efficient as the hierar- 
chy kernel, which is an interesting result. Indeed, one could fear that using the hierarchy kernel, for 
the construction of which some knowledge of the ligands may have been used, could have introduced 
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bias in the results. Such bias is certainly absent in the binding pocket kernel. The fact that the 
same performance can be reached with kernels based on the mere sequence of GPCRs' pockets is 
therefore an important result. Figure [3] shows four of the GPCR kernels. The baseline multitask is 
shown as a comparison. Interestingly, many of the subgroups defined in the hierarchy can be found 
in the binding pocket kernel, that is, they are retrieved from the simple information of the binding 
pocket sequence. This phenomenon is even more visible for the binding pocket hierarchy kernel that 
is based on the hierarchy built from the binding pocket alignment scores. 




Figure 4: Improvement (as a performance ratio) of the hierarchy GPCR kernel against the Dirac 
GPCR kernel as a function of the number of training samples available. Restricted to [2 — 200] 
samples for the sake of readability. 

The 3D kernel for the ligands, on the other hand, did not perform as well as the 2D kernel. This 
can be either explained by the fact the the pharmacophore kernel is not suited to this problem, or 
by the fact that choosing the conformer of the ligand is not a trivial task. This point is discussed 
below. 

Figure |4] illustrates how the improvement brought by the chemogenomics approach varies with 
the number of available training points. As one could have expected, the strongest improvement is 
observed for the GPCRs with few (less than 20) training points {i.e., less than 10 known ligands since 
for each known ligand an artificial non-ligand was generated). When more training points become 
available, the improvement is less important, and sharing the information across the GPCRs can 
even degrade the performances. This is an important point, first because, as showed on Figure [5j 
many GPCRs have few known ligands (in particular, 11 of them have only two training points), 
and second because it shows that when enough training points are available, individual learning will 
probably perform as well as or better than our chemogenomics approach. 

Our second experiment intends to assess how our chemogenomics approach can perform when 
predicting ligands for orphan GPCRs, i.e., with no training data available for the GPCR of interest. 
Table [2] shows that in this setting, individual learning performs random prediction. Naive multitask 
approach does not improve much the performance, but informative kernels such as hierarchical and 
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Figure 5: Distribution of the number of training points for a GPCR. Restricted to [2 — 200] samples 
for the sake of readability. 



Ktar\Kiig 2D Tanimoto 3D pharmacophore 



Dirac 


50.0 ±0.0 


50.0 ±0.0 


multitask 


56.8 ±2.5 


58.2 ±2.2 


hierarchy 


77.4 ± 2.4 


76.2 ±2.2 


binding pocket 


78.1 ±2.3 


76.6 ± 2.2 


poly binding pocket 


76.4 ±2.4 


74.9 ± 2.3 


binding pocket hierarchy 


75.5 ± 2.4 


76.5 ±2.2 



Table 2: Prediction accuracy for the second experiment with various ligand and target kernels. 



binding pocket kernels achieve 77.4% and 78.1% of precision respectively, that is, almost 30% better 
than the random approach one would get when no data is available. Here again, the fact that the 
binding pocket kernel that only uses the sequence of the receptor pocket performs as well as the 
hierarchy-based kernel is encouraging. It suggests that given a receptor for which nothing is known 
except its sequence, it is possible to make reasonable ligand predictions. 



5 Discussion 

We showed how sharing information across the GPCRs by considering a chemogenomics space of the 
GPCR-ligand interaction pairs could improve the prediction performances. In addition, we showed 
that using such a representation, it was possible to make reasonable predictions even when no ligand 
was known for a given GPCR, that is, to predict ligands for orphan GPCRs. Our approach is simply 
to apply well known machine learning methods in the constructed chemogenomics space. We used a 
systematic way to build such a space by combining a given representation of the ligands with a given 
representation of the GPCRs into a binding-prediction-oriented GPCR-ligand couple representation. 
This allows to use any ligand or GPCR descriptor or kernel existing in the chemoinformatics or 
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bioinformatics literature, or new ones containing other prior information as we tried to propose in 
this paper. Our experiments showed that the choice of the descriptors was crucial for the prediction, 
and more sophisticated features for either the ligands or the GPCRs could probably further improve 
the performances. 

In all experiments, 3D pharmacophore kernels did not reach the performance of 2D kernels for 
the ligands. This is apparently in contradiction we the idea that protein-ligand interaction is a 
process occurring in the 3D space, and that introduction of 3D information should increase the 
performance. Different explanations can be proposed. The choice of the low energy conformer 
was guided by the following idea. Because only two ligand conformations bound to GPCR re- 
ceptors are available, it was not possible to derive any general information that could be used to 
choose a potential bioactive 3D conformer for each molecule of the ligand base. In this context, 
the only possible reasonable assumption was that, while interaction with the receptor will certainly 
perturb the conformational energy surface of a flexible ligand, high affinity would be observed for 
ligands that bin d in a conform ation that is not exceptionally different from a local free state en- 
ergy minimum (jBostroml . l20Qll ). Although there exists a large number of methods for exploring 
the conformation space of a molecule, we used the Omega program that performs rapid systematic 
conformer search, b ecause it has been sh owed to present good performances for retrieving bioac- 
tive conformations ( Bostrom et al. . 20031 ). However, the set of parameters used to run Omega in 
this study (because of calculation time limitations) may not have allowed to reach a local energy 
minimum: generating a larger number of conformers, with a smaller RMSD clustering value may 
have helped to find better energy minima, and this could be further evaluated. Moreover, some 
studies report that the bioactive conformation of a molecule can differ from the minimum energy 
conformation, and that significant strain energies can indeed be found for molecules in complex 
with proteins ( Perola and Charifson . 20041 ). We cannot rule out the possibility that this is the case 
for GPCR ligands. In the future, resolution of additional 3D structures in this family will help 
to clarify this point. One possible improvement of the method could be to use homology models 
for the GPCRs, dock the ligand base in the modeled binding pockets, and build a 3D ligand base 
using, for each molecule, the conformer associated to the best docking solution. In other families 
of proteins, enzymes for example, where many structures are available and can be used to define 
bioactive conformers, the 3D pharmacophore kernel is expected to improve performance, as observed 
in a previous pure ligand-based study involving ligands i n a given series, f or which the bioactive 
conformation can be inferred from a known 3D structure ( Mahe et al. . 20061 ). 



Various evidence suggest that, within a common global architecture, a generic binding pocket 
mainly involving transmembrane regions hosts agonists, antagonists and allosteric modulators. In 
order to identify this pocket automaticall y, other studies report t he use of sequence alignment and 
the prediction of transmembrane helices. Kratochwil et al. ( 20051 ) detected hypervariable positions 
in transmembrane helices for identification of residues forming the binding pocket. The under- 
lying idea was that conserved residues were probably important for structure stabilization, while 
variable positions were involved in ligand binding, in order to accommodate the wide spectrum of 
molecules that are GPCR substrates. Using this method, they proposed potential binding pock- 
ets for GPCRs, and f ound that the cor r espon ding residues were frequently in the GRAP mutant 
database for GPCRs ( Kristiansen et al. . 19961 ). Interestingly, these authors pointed that residues 
corresponding to these hypervariable positions were found within a distance of 6 from retinal in 
the rhodopsin X-Ray structure. Therefore, although we used a different method to automatically 
extract binding pocket residues in the GPCR families, our results are in good agreement with this 
study. 

Interesting developments of this method could include application to quantitative prediction 
of the binding affinities, that would be straightforward using regression algorithms in the same 
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chemogenomics space. Anot her possibility is app l icatio n to other important drug target families, 
like enzymes or ion channels ([Hopkins and Grooml . for which most of the descriptors used for 



the GPCRs in this paper could directly be used, and other, more specific ones could b e designed . 



Frora a methodological point of view, many rec ent developments in multitask learning (jVert et al 



2006 : Argyriou et al. . 2007 : Bonilla et al. . 20081 ) could be applied to generalize this chemogenomics 



approach using, for example, other regularization methods. 
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Gamma-aminobutyric acid type B receptor 
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Table 3: Residues of 5-hydroxytryptamine 5A receptor, Adenosine A2b receptor, Gamma-aminobutyric acid type B receptor and Relaxin 
3 receptor 2 (shown as examples) aligned with /32- adrenergic receptor binding site amino acids, the binding pocket motif of /32-adrenergic 
receptor has been used as reference to determine residues involved in the formation of the binding site of the 79 other GPCRs. Bold 
columns correspond to the residues shown on Figure [2l 



